model weight
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- Media (0.34)
- Leisure & Entertainment (0.34)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.72)
Supplementary material: Ensembling geophysical models with Bayesian Neural Networks Anonymous Author(s) Affiliation Address email
This is based on work from Knutti et al. The heteroscedastic loss function is prone to episodes of catastrophic forgetting. Synthetic experiment Ozone experimentSpatial coord scaling 2 2 Temporal coord scaling (month of year) 1 2 Temporal coord scaling (total months) 1 1 Number of physical models 4 15 Number of neural network ensemble members 50 65 Bias mean. Noise mean prior 0. 02 0 .015 In the following, we derive the anchored ensembling loss function for the heteroscedastic case.
Ensembling geophysical models with Bayesian Neural Networks
Ensembles of geophysical models improve prediction accuracy and express uncertainties. We develop a novel data-driven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotem-porally varying model weights and bias, while accounting for heteroscedastic uncertainties in the observations. This produces more accurate and uncertainty-aware predictions without sacrificing interpretability.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > Saint Martin (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
EMR-Merging: Tuning-Free High-Performance Model Merging
The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or training. In this paper, we rethink and analyze the existing model merging paradigm. We discover that using a single model's weights can hardly simulate all the models' performance. To tackle this issue, we propose Elect, Mask & Rescale-Merging (EMR-Merging). We first (a) elect a unified model from all the model weights and then (b) generate extremely lightweight task-specific modulators, including masks and rescalers, to align the direction and magnitude between the unified model and each specific model, respectively. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance. We find that EMR-Merging shows outstanding performance compared to existing merging methods under different classical and newly-established settings, including merging different numbers of vision models (up to 30), NLP models, PEFT models, and multi-modal models.
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload.
Verifying LLM Inference to Detect Model Weight Exfiltration
Rinberg, Roy, Karvonen, Adam, Hoover, Alexander, Reuter, Daniel, Warr, Keri
As large AI models become increasingly valuable assets, the risk of model weight exfiltration from inference servers grows accordingly. An attacker controlling an inference server may exfiltrate model weights by hiding them within ordinary model outputs, a strategy known as steganography. This work investigates how to verify model responses to defend against such attacks and, more broadly, to detect anomalous or buggy behavior during inference. We formalize model exfiltration as a security game, propose a verification framework that can provably mitigate steganographic exfiltration, and specify the trust assumptions associated with our scheme. To enable verification, we characterize valid sources of non-determinism in large language model inference and introduce two practical estimators for them. We evaluate our detection framework on several open-weight models ranging from 3B to 30B parameters. On MOE-Qwen-30B, our detector reduces exfiltratable information to <0.5% with false-positive rate of 0.01%, corresponding to a >200x slowdown for adversaries. Overall, this work further establishes a foundation for defending against model weight exfiltration and demonstrates that strong protection can be achieved with minimal additional cost to inference providers.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
Deng, Lishuo, Xu, Shaojie, Chen, Jinwu, Yan, Changwei, Wang, Jiajie, Jiang, Zhe, Shan, Weiwei
Abstract--Deploying large language models (LLMs) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the decode phase with flash, yet still rely on DRAM for the key-value (KV) cache. As context length grows, the KV cache can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload KV cache to flash suffer from severe performance penalties. We propose KVNAND, the first DRAM-free, IFC-based architecture that stores both model weights and KV cache entirely in compute-enabled 3D NAND flash. KVNAND addresses the fundamental performance challenges of flash under intensive KV cache access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level KV cache mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact KVNAND variants to balance weight and KV placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context KV storage. Evaluations on MHA 7B and GQA 70B LLMs show that KVNAND achieves 1.98 /1.94 /2.05 geomean speedup at 128/1K/10K-token contexts compared to DRAMequipped IFC designs and addresses out-of-memory failures at 100K context length. As Large Language Models (LLMs) integrate into daily workflows, demand increases for personalized AI agents that align with user preferences, domain knowledge, and interaction styles. Deploying such agents on edge devices offers privacy, low-latency responsiveness, and cost efficiency by eliminating cloud dependency, making on-device LLMs a compelling direction for AI democratization [81]. Realizing high-quality personal LLM agents on resource-limited edge devices faces two main bottlenecks: memory capacity and bandwidth. The growing demand for long-context agentic workflows like long document analysis [35], multi-turn dialogue [84], and chain-of-thought reasoning [10] introduces the KV cache as another dominant consumer of this limited memory [19], [74]. Moreover, recent state-of-the-art (SoT A) models support extensive context lengths ranging from 128K (LLaMA3.1-70B The KV cache demand scales linearly with context length; for example, a 13B model already requires 8 GB KV memory at a 10K context [71], placing prohibitive pressure on edge resources.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > California > San Diego County > La Jolla (0.04)
- Europe > Netherlands > South Holland > Rotterdam (0.04)
- (3 more...)
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Xiong, Zixun, Wu, Gaoyi, Yu, Qingyang, Ma, Mingyu Derek, Yao, Lingfeng, Pan, Miao, Du, Xiaojiang, Wang, Hao
Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning, or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results.
- North America > United States > Texas > Harris County > Houston (0.04)
- Asia > China (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government > North America Government > United States Government (0.46)